Identifiers in Freebase
... how they look from RDF
Prelude
Most Linked Data sets represent links to the outside world in a format like
:internalResource owl:sameAs external:thatResource .
where owl:sameAs
could be replaced by some other predicate which is not so problematic in its definition. When data is linked so, you have many options for integration, such as loading everything into the same triple store, or derferencing URIs one at a time.
Freebase was conceived before the time of Linked Data and SPARQL so it developed its own method of mapping identifiers to concepts; this information is expressed in two different ways in RDF.
For the purpose of concision, the :BaseKB Compact Edition supports only one of these mechanisms, the :type.object.key
predicate, while the :BaseKB Complete Edition supports both.
This article teaches you how to look up external and identifiers using the :type.object.key
predicate and the special key:
namespace.
Inventor of the Traffic Light
Let's take case of Garret Morgan, who is :m.01tp2v in Freebase and who comes about as close to a real-life Tony Stark as anyone. If we look up identifiers that Freebase knows for him with this query
sparql
prefix : <http://rdf.basekb.com/ns/>
select ?key {
:m.01tp2v :type.object.key ?key .
} ORDER BY ?key
we get
Note that Freebase keys are structured like path in Unix. :type.object.key
spells them out completely, while the alternative representation represents the directed acyclic graph directly.
Note that some of these identifiers have been inserted by external entities, (ex. /base/ranker/
and /user/avh/ellerdale
), we also see a key in the /en/
namespace which means you can refer to this entity as /en/garret_a_morgan
in MQL queries. In the early days, Freebase created human-readable identifiers for all topics, but this policy did not scale well, and Freebase eventually converged on the consistent use of mids for everything that is not a type or a property.
Unicode character encoding in keys
An important bit of convention is that Freebase encodes non-plaintext characters in identifiers as $xxxx
where xxxx
is hexadecimal for a 16-bit Unicode codepoint. You can see this used above, where "Garret A. Morgan" is spelled out as
Garret_A$002E_Morgan
heaxdecimal 2E
is decimal 46 in ASCII and Unicode, which represents a period. The same encoding is used for the Korean variant "개릿 모건", which is Morgan's name spelled out phonetically
/wikipedia/ko/$AC1C$B9BF_$BAA8$AC74
Note that characters in the upper plane (with codepoints greater than $FFFF
) are encoded as a pair of symbols using surrogate characters. The following Java function decodes the $
sequences in Freebase keys:
public String unescapeFreebaseKey(String in) {
StringBuilder out=new StringBuilder(in.length());
String [] parts=in.split("[$]");
out.append(parts[0]);
for(int i=1;i<parts.length;i++) {
String hexSymbols=parts[i].substring(0,4);
String remainder="";
if(parts[i].length()>4) {
remainder=parts[i].substring(4);
}
int codePoint=Integer.parseInt(hexSymbols,16);
char[] character=Character.toChars(codePoint);
out.append(character);
out.append(remainder);
}
return out.toString();
}
The key(s) to Wikipedia
Let's take a look at the conventions used in Wikipedia keys. Wikipedia keys come in several kinds:
\wikipedia\{lang}\
\wikipedia\{lang}_title\
\wikipedia\{lang}_id\
where {lang}
is an ISO 639-1 or a variation of an ISO code. Wikipedia keys are derived from Wikipedia titles by replacing the space character with an underscore, and escaping punctuation and non-ASCII characters with the $
-convention describe above.
A page in Wikipedia has a "real" title, but may appear under different names because of redirect records that point to the real page. The real title is encoded in the \wikipedia\{lang}_title\
namespaces, whereas the titles that redirect to the real title are encoded in the \wikipedia\{lang}\
namespaces.
Generally systems should accept all Wikipedia titles from the outside system, but should use the official form when exporting data to the outside.
Wikipedia titles have the special property of being unique, unlike Freebase titles, which can be shared by many objects. Wikipedia titles are disambiguated in a rather ad-hoc manner. Sometimes Wikipedians choose names to avoid conflict, but frequently they add something to the title to disambiguate it, such as a few words in parenthesis giving the type of of the object, for example
Note that the /{lang}_id/
namespace contains numeric identifiers, which are the internal primary key in the database tables behind Wikipedia. These identifiers are supposed to remain stable when titles change, so they provide one more interconnection between Freebase and Wikipedia.
Keys in the complete edition
The Compact Edition of :BaseKB contains only the :type.object.key
identifiers. I believe these are sufficient for almost any task, but the Complete Edition provides a different view of Freebase keys. It so turns out that any Freebase namespace, like
/authority/iso/3166-1/alpha-2
can be converted to a URI
<http://rdf.freebase.com/key/key.authority.iso.3166-1.alpha-2>
and Freebase uses this as a predicate like so
?subject ?keyPredicate "String_Value_Of_Key".
In this case, ISO 3166-1 Alpha 2 is the fancy name for the commonly used two-letter country abbreviations, and by searching this namespace, we can make a list of current countries, together with their codes and labels.
prefix : <http://rdf.basekb.com/ns/>
prefix key: <http://rdf.basekb.com/key/key.>
select ?country ?code ?label {
?country key:authority.iso.3166-1.alpha-2 ?code .
?country rdfs:label ?label .
FILTER(lang(?label)='en')
}
the first few results look like
Conclusion
Freebase has a mechanism for representing internal or external identifiers that is expressed in two different ways. When you learn how to use this mechanism, you'll find it easy to link up Freebase with other data sources.
Creator of database animals and bayesian brains